Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write checkable create & delete sla history events #566

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

yhabteab
Copy link
Member

@yhabteab yhabteab commented Feb 23, 2023

Why do we need this!

Currently we are generating the SLA history events only when e.g. there were state change and downtime start and end events for the checkables. Under some circumstances (if a checkable is created once and never deleted) this should be sufficient. However, when you e.g. delete a host and create it again a couple of days later and want to generate sla reports for this host at the end of the week, the result can vary depending on which state the host had before it was deleted. In order to be able to generate sla reports as accurately as possible, we decided to track the checkable creation and deletion time on top of the existing info. And since Icinga 2 doesn't really know when an object has been deleted (at least not in a simple way), this PR should take care of it.

Though, Icinga DB doesn't know when an object has been deleted either, it just takes the time the delete event for that object arrived and puts it into the new table. Meaning when you delete checkables while Icinga DB is stopped, the events Icinga DB would write after it is started won't reflect the actual delete/create event. Though, there is no better way to handle this gracefully.

Config sync

The upgrade script for 1.3.0 generates a created_at sla lifecycle entry for all existing hosts and services once it is applied as proposed in #566 (comment). Thus, all special cases such as the implementation of a custom fingerprint type for services1 and performing an extra query to retrieve host IDs from the database for runtime deleted services, have been removed.

Implementation

The new table sla_history_lifecycle has a primary key over (id, delete_time) where delete_time=0 means "not deleted yet" (the column has to be NOT NULL due to being included in the primary key). id is either the service or host ID for which that sla lifecycle is being generated. This ensures that there can only be row per object that states that the object is currently alive in Icinga 2.

Initial sync

Icinga DB performs a simple INSERT statement for Host and Service types after each initial config dump unconditionally, but matches on hosts and services that don't already have a create_time entry with delete_time = 0 in the sla_lifecycle table, and sets their create_time timestamp to now. Additionally, it also updates the delete_time of each existing sla_lifecycle entries whose host/service IDs cannot be found in the Host/Service tables. It's unlikely, but when a given Checkable doesn't already have a create_time entry in the database, the update query won't update anything. Likewise, the insert statements may also become a no-op if the Checkables already have a create_time entry with delete_time = 0.

Create

Nothing to be done here (all newly created objects will be covered by the bulk INSERT ... SELECT FROM host/service queries after the config dump).

Update

Nothing to be done here (object existed before and continues to exist).

Delete

Nothing to be done here (all deleted objects will be covered by general bulk sea_lifecycle queries after the config dump).

Runtime updates

Upsert

Performs an INSERT with ignore for duplicate keys for both create and update events (these look identical in the runtime update stream). If the object is already marked as alive in sla_history_lifecycle, this will do nothing, otherwise it will mark it as created now (including when an object that was created before this feature was enabled is updated).

Delete

It assumes that there exists a created_at sla_lifecycle entry for that checkable currently going to be deleted, and performs a simple UPDATE statement setting delete_time = now (i.e. updates the PK of the row) marking the alive row for the object as deleted. If, for whatever reason, there is no corresponding created_at entry for this checkable, that update statement becomes a no-op, as the upgrade script and/or the initial config dump should have generated the necessary entries for all existing objects that were created before this feature was available.

Footnotes

  1. Before @julianbrost proposed a change in https://github.com/Icinga/icingadb/pull/566#issuecomment-2273088195, services had to implement an additional custom fingerprint type ServiceFingerprint which was used to also retrieve their host IDs when computing the config delta of the initial sync. By introducing this type, the necessity of having to always perform an extra SELECT query to additionally retrieve the host IDs was eliminated, as host ID is always required for the sla lifecycles to work.

@cla-bot cla-bot bot added the cla/signed label Feb 23, 2023
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 3 times, most recently from 2853ab4 to 87e94ac Compare February 24, 2023 09:14
@yhabteab yhabteab requested a review from Al2Klimov February 24, 2023 09:41
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 2 times, most recently from 1cc1f09 to 80a76e5 Compare February 27, 2023 12:00
schema/mysql/schema.sql Outdated Show resolved Hide resolved
pkg/icingadb/sync.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 5 times, most recently from 2fc3663 to c160354 Compare March 2, 2023 12:54
@yhabteab yhabteab requested a review from Al2Klimov March 2, 2023 12:56
schema/mysql/schema.sql Outdated Show resolved Hide resolved
schema/mysql/schema.sql Outdated Show resolved Hide resolved
schema/pgsql/schema.sql Outdated Show resolved Hide resolved
schema/mysql/schema.sql Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 3 times, most recently from 2a4824a to 2717c61 Compare March 2, 2023 14:41
@yhabteab yhabteab requested a review from Al2Klimov March 2, 2023 14:42
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 2 times, most recently from 93d6f3f to 69bc4ad Compare March 2, 2023 16:37
@yhabteab yhabteab requested review from Al2Klimov and removed request for Al2Klimov March 2, 2023 16:39
@julianbrost
Copy link
Contributor

I just had an idea how we could call that type of SLA history after we didn't really come up with good name for this initially: lifecycle

@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 4 times, most recently from 753dba4 to f1878aa Compare March 3, 2023 09:47
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don’t force-push for now.

pkg/types/int.go Outdated Show resolved Hide resolved
pkg/icingadb/db.go Outdated Show resolved Hide resolved
pkg/icingadb/db.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/types/int.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 2 times, most recently from abdb06f to f8ec4c2 Compare September 4, 2024 07:00
@yhabteab
Copy link
Member Author

yhabteab commented Sep 4, 2024

I've pushed the new transparent Screenshots and also changed the main function a bit, so I won't be pushing anything from now on unless someone requests a change.

@yhabteab yhabteab requested review from oxzi and removed request for Al2Klimov September 4, 2024 07:07
cmd/icingadb/main.go Outdated Show resolved Hide resolved
cmd/icingadb/main.go Outdated Show resolved Hide resolved
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 2 times, most recently from 507833d to 6776694 Compare September 5, 2024 14:00
@yhabteab yhabteab requested a review from oxzi September 5, 2024 14:03
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch from 6776694 to 6ed6609 Compare October 24, 2024 11:27
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch from 6ed6609 to 26195ed Compare October 25, 2024 07:11
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch from 26195ed to 7aa02dd Compare October 25, 2024 07:14
@yhabteab yhabteab requested review from oxzi, julianbrost and lippserd and removed request for julianbrost, lippserd and oxzi October 25, 2024 07:14
the database without any interpretation. In order to generate and visualise SLA reports of specific hosts and services
based on the accumulated events over time, [Icinga Reporting](https://icinga.com/docs/icinga-reporting/latest/doc/02-Installation/)
is the optimal complement, facilitating comprehensive SLA report generation within a specific timeframe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the tone of the first two paragraphs to make it sound more like a technical document and less like a PR post. For example, the "legally binding" part could be dropped, as it must not be the case. Furthermore, Icinga Reporting is (currently) not the "optimal complement", but the only complementing UI.

Please don't take this the wrong way. If I would read this as a technical user - and this is technical documentation -, I would have the feeling that somebody wants to sell me something and not like I am going to find the details here.

| PENDING (no check result yet) | OK |
| OK | Warning |
| Warning | Critical |
| Critical | OK |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed earlier, please include the state's magic numbers used within the database. This eases debugging and understanding when using this technical document together with the own setup.

// clearly identifies its state. Consequently, such events become irrelevant for the purposes of calculating
// the SLA and we must exclude the duration of that PENDING state from the total time.
total_time = total_time - (event.event_time - last_event_time)
else if (previous_hard_state is greater than OK/UP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the mapping from states to their numeric representation, this is quite cryptic. Thus, either introduce a mapping before or make it more explicit, like listing all acceptable previous_hard_state values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand you correctly, but I don't quite agree that this being cryptic, rather it's the exact opposite. These are just normal Icinga 2 host - service states that have their own documentation except the PENDING state. Apart from that, I have actually listed the possible values that the previous_hard_state column represents and you want me to add another listing here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to put yourself in a reader's position, who knows a bit about Icinga, but definitely no implementation details. When reading the comparison in this line, it may be unclear what "greater than" means in this context, as no numerical representation or hierarchy of states was introduced.

Apart from that, I have actually listed the possible values that the previous_hard_state column represents and you want me to add another listing here?

Based on this table, one would assume the following order: PENDING < OK < WARNING < CRITICAL.
And what is about UNKNOWN and the host states UP and DOWN?

Coming back to the if cases, there is

  1. event.previous_hard_state is PENDING and
  2. previous_hard_state is greater than OK/UP AND previous_hard_state is not PENDING AND checkable is not in DOWNTIME (this one misses the event. prefix, but I guess it was just forgotten?).

Shouldn't it be possible to rewrite the second clause to event.previous_hard_state is not in {OK, UP} AND checkable is not in DOWNTIME as event.previous_hard_state cannot be PENDING due to the first check? Doing so eliminates any order which has to be explained otherwise.

pkg/icingadb/sla_lifecycle.go Show resolved Hide resolved
@lippserd lippserd modified the milestones: 1.2.1, 1.3.0 Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla/signed enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants